De novo transcriptomes of six calanoid copepods (Crustacea): a resource for the discovery of novel genes

This study presents eight new high-quality de novo transcriptomes from six co-occurring species of calanoid copepods, the first published for Neocalanus plumchrus, N. cristatus, Eucalanus bungii and Metridia pacifica and additional ones for N. flemingeri and Calanus marshallae. They are ecologically-important members of sub-arctic North Pacific marine zooplankton communities. ‘Omics data for this diverse and numerous taxonomic group are sparse and difficult to obtain. Total RNA from single individuals was used to construct gene libraries that were sequenced on an Illumina Next-Seq platform. Quality filtered reads were assembled with Trinity software and validated using multiple criteria. The study’s primary purpose is to provide a resource for gene expression studies. The integrated database can be used for quantitative inter- and intra-species comparisons of gene expression patterns across biological processes. An example of an additional use is provided for discovering novel and evolutionarily-significant proteins within the Calanoida. A workflow was designed to find and characterize unannotated transcripts with homologies across de novo assemblies that have also been shown to be eco-responsive.


Background & Summary
Transcriptomics, the application of high-throughput sequencing of millions of short nucleotide sequences generated from mRNA, has transformed studies of the physiological ecology of marine organisms [1][2][3][4][5] . Gene expression profiles of one or more species have resulted in new insights into physiological acclimatization, metabolic trade-offs and resilience, causes of high-mortality events and niche partitioning between similar species. However, progress in the application of environmental transcriptomics has been hampered by a scarcity of genomic resources, especially for co-occurring zooplankton species. Furthermore, existing references for gene expression studies are of variable quality and many genes, including eco-responsive genes remain unannotated 5 . Here, we present eight de novo transcriptomes for six calanoid copepod species that co-occur across the Sub-arctic North Pacific. The goals were to (1) generate high-quality assemblies, (2) assess their value as references for cross-species comparisons; and (3) to develop a workflow to search for and identify novel predicted proteins that are unique to copepods.
Calanoida is an order within the Copepoda, the second largest sub-class of the Crustacea. Calanoids are small (≤15 mm) free-living, mostly planktonic organisms that inhabit marine and freshwaters worldwide 6,7 . Abundant and ecologically important, they are a key link between phytoplankton and the upper trophic levels, such as larval fishes and other planktivores including invertebrates, sea birds and marine mammals 7 . Relative to their importance they are understudied: their natural habitats are difficult to access and few species are amenable to laboratory cultivation. Nevertheless, calanoids are known for many unique adaptations (high transparency, fluorescent proteins, bioluminescence, myelin, unique sensory structures) that have set them apart from other better studied arthropods [8][9][10][11][12] . These cellular properties depend on the evolution of novel proteins. Thus, it is not surprising that de novo assemblies of copepods always include a significant number of predicted proteins 1 Pacific Biosciences Research Center, University of Hawai'i at Mānoa, 1993 East-West Rd., Honolulu, HI, 96822, USA. 2 Integrative Marine Ecology Department, Stazione Zoologica Anton Dohrn, Naples, Italy. ✉ e-mail: vittoria.roncalli@ szn.it

DATA DesCRiPTOR
OPeN Methods Work-flow. A diagram of the workflow used in this study is presented in Fig. 1. De novo transcriptomes were assembled from short-read sequences generated from RNA extracted from field-collected individuals. Each assembly was assessed for quality using multiple indicators, and assembled transcripts were annotated based on sequence similarity. Cross-species comparisons focused on identifying annotated and unannotated orthologous genes among the six species. In downstream analysis, selected unannotated genes were examined for their eco-responsiveness, their presence as orthologs across taxonomic groups and functional motifs in their predicted proteins.  Table 1 lists the collection date, station and depth stratum for each individual. Zooplankton collections were made using either a QuadNet with two 150 µm and two 53 µm mesh nets (April collection), or a multiple opening and closing plankton net (0.25 m 2 cross-sectional area; 150 μm mesh nets; Multinet-Midi, Hydro-Bios; September collections). The QuadNet net was towed vertically from 100 to 0 m, while the Multinet was towed either vertically or obliquely sampling depth strata between 700 m and the surface. The live zooplankton collections were immediately diluted into ambient seawater and maintained at collection temperatures (5 -7 °C). Copepods were removed from the diluted samples using a small ladle, and sorted under a dissection microscope to select individuals from the target species (Table 1). Briefly, live and undamaged individuals were identified and staged using morphological criteria, with species identification being confirmed through the COI sequence in the assembled transcriptomes. Identified copepods were isolated using a wide-bore pipette into a dish with filtered seawater before transferring them into microcentrifuge tubes (1.6 ml) with 0.5 ml of RNAlater Stabilization Reagent (QIAGEN). Preserved copepods were frozen first in −20 °C during the cruises, and then transferred to −80 °C until further processing.  Table 1. Summary of Alaskan calanoid species and collection details used to generate the RNA-Seq data. Taxonomic classification of species and superfamily designation according to Andronov (1974) 33 . State of myelination according to Davis et al. (1999) and Lenz  Functional annotation. Assemblies were functionally annotated against the NCBI Swiss-Prot protein and UniProt databases. Initial annotations were obtained by using the BLASTx algorithm on a local BLAST webserver with a Beowulf cluster using the Swiss-Prot protein database (downloaded February 2021) as reference and a threshold E-value of 10 −5 . Transcripts with BLAST annotations were then searched against the Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway databases using UniProt 32 .

Fig. 1
Flow chart of the processing of individual copepods for RNA-seq and data analyses. After processing short sequence reads (150 bp, paired-end) for quality and Trinity assembly, de novo transcriptomes were assessed for quality (left), annotated (center) and compared across species (right). The flow on the right branch focuses on the analysis of full-length predicted proteins and includes the identification of orthologous genes and the search for novel proteins.
www.nature.com/scientificdata www.nature.com/scientificdata/ species confirmation and contamination testing. The annotation results were also used to confirm species identity and to provide an initial assessment of possible contaminants. Sequences annotated as "COX1" (cytochrome oxidase subunit 1 or "COI") were retrieved from each file, followed by a confirmatory BLASTn into the NCBI nucleotide collection (nr/nt) database. BLAST results for each top hit were checked for a species match, transcript coverage, E-value and percent identity (>97% identity required). To search further for fragments of foreign COI sequences, an artificial "species filter" that quantifies such contaminants 33 , was constructed from the full-length COI sequences of each of the species presented here, as well as likely contaminants (See Supplementary Information SD1, Part II). Raw reads from each transcriptome were mapped against the artificial one, and the counts attracted to the COI of each species were used as a measure of contamination level. In addition, each assembly was searched manually for ribosomal RNA sequences. The BLASTn algorithm was run using species-specific 18 S rRNA sequences as queries to: (1) obtain the native sequence(s) most closely matching the rRNA reference; and (2) to search for rRNA sequences indicating possible foreign contamination. MAFFT (v. 7.511) alignments were used to identify sequences with identities in the low 90% range or below ("% ID to native" column in Supplementary Information SD2) as candidates for foreign rRNA. Such candidates were then checked for taxon identification and magnitude of contamination using a reciprocal BLAST (BLASTn) and relative abundance in the count file generated by Bowtie2.
Protein prediction. The assembled transcriptomes were further validated by tallying the numbers of predicted proteins they contained and the homologies among them, both within and between species. For each assembly, open reading frames (ORFs) were identified using TransDecoder (v. 5.5.0; setting: report only the single best ORF) 29 . The resulting translated transcriptomes were filtered for predicted full-length ("complete") proteins.
Homolog clustering. Prevalence of homologs, especially among closely-related species, was used as a further validation of transcriptome quality. Groups of predicted proteins with homologies among selected combinations of the new transcriptomes were generated using OrthoVenn2, a web-based tool designed to find sets of homologous proteins, termed "clusters" 34 . When homology clusters needed a common reference, we included a vetted transcriptome previously-published for a stage CV N. flemingeri (NCBI accession series GHLB01000000, BioProject PRJNA496596) 3 . To estimate the reproducibility of separate assemblies of the same species, we generated two intraspecies cluster sets, one for N. flemingeri (n = 3) and one for C. marshallae (n = 2). To estimate the extent of homologies between transcriptomes of more distantly-related species, we generated heterospecific cluster sets for each of the assemblies paired with the N. flemingeri reference transcriptome. Finally, to survey the consistency of homologies across different phylogenetic breadths in these transcriptomes, we generated cluster sets from multi-species selections of transcriptomes representing four taxonomic categories: within genus (n = 3, Neocalanus), within family (n = 4, Calanidae: Neocalanus species plus C. marshallae), within the Myelinata (n = 5, Calanidae + E. bungii), and within order (n = 6, Calanoida: Myelinata + M. pacifica).
Bioprospecting for novel proteins. To illustrate the potential for these transcriptomes to be useful in discovering novel proteins responsive to changes in environmental conditions, we focused on multi-species clusters of homologs for which OrthoVenn2 failed to find annotations in SwissProt. Starting with the clusters in the 6-species Calanoida set just described, which we will refer to as the "primary set, " we selected three secondary subsets of clusters having the three taxonomic coverages just described: Neocalanus, Calanidae, and Myelinata. For each of these subsets, we retrieved from OrthoVenn2 a "clusters shared by all" list -a list of homologous clusters containing representation from all of the selected species (see Supplementary Information SF1 for examples). While these subsets contained all clusters in the primary set with homologs (orthologs) shared among the targeted secondary species, some of the clusters also included homologs from outside of the targeted taxa. To restrict the list of candidate clusters to a taxonomically homogeneous set that lacked such "foreign" homologs, we searched the lists and removed any such outsiders; i.e. from the Myelinata subset we removed clusters that contained homologs in the amyelinate M. pacifica transcriptome; from the Calanidae subset, M. pacifica and E. bungii-containing clusters; and from the Neocalanus subset, M. pacifica, E. bungii and C. marshallae clusters.
The resulting lists in each of the three taxonomic categories, which are enriched for potential novel and taxonomically-unique proteins, were then further refined by searching for environmentally-regulated proteins. Each list was compared with lists of translated non-annotated N. flemingeri genes that had been identified in previous studies as differentially expressed in response to environmental conditions 3 . To pinpoint where these candidate novel proteins might have evolved, we put each through a tBLASTn taxonomic scan of transcriptomes in the NCBI TSA databases, searching for similar sequences in transcriptomes in taxonomic categories with increasing phylogenetic distance from N. flemingeri. Two measures were employed to refine the assessment of novelty: first, the E-value of the best match (top hit) for each search was used as a measure of similarity for that target taxonomic category. Novel proteins were expected to have a high similarity among those taxa possessing them and a much lower similarity for taxa lacking them. Thus, a sharp rise in the similarity (i.e. decrease in E-value) with taxonomic proximity to N. flemingeri was taken as a likely phylogenetic point of emergence for the novel protein. Second, this predicted emergence-point was referenced to a cladogram constructed from the top hits, translated into proteins. The translated proteins were aligned using MAFFT software 35,36 and then manually trimmed to remove non-conserved regions. The cladogram was constructed from these sequences with the software package for phylogenetic analysis using Bayesian Markov chain Monte Carlo methods, MrBayes 3.2 37-39 , with two independent runs of four chains each and 10,000,000 generations (the initial 25% discarded as burn-in) using the WAG substitution model of protein evolution 40 and a gamma distribution of rates. Maximum likelihood bootstrap values were obtained using RAxML 8 with 1,000 bootstrap (2023) 10:242 | https://doi.org/10.1038/s41597-023-02130-1 www.nature.com/scientificdata www.nature.com/scientificdata/ replicates using the WAG substitution model and a gamma distribution of rates 41 . The consensus tree from MrBayes was visualized in FigTree v. 1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/) and the bootstrap values were reported. The cladograms were compared with the known phylogenetic relationships among the component species. The preferred outgroup(s) for each cladogram was taken from the crustacean category (excluding Copepoda) or from the Arthropoda (excluding Crustacea). If neither category produced a satisfactory outgroup, one from a greater taxonomic distance was used. Selected sequences were examined for function-related motifs as a step toward determining possible roles for the non-annotated proteins. Motif searches were performed via three web-based search portals: UCL's PSIPRED and MEMSAT (http://bioinf.cs.ucl.ac.uk/psipred/); Kyoto University Bioinformatics Center's MotifSearch (https://www.genome.jp/tools/motif/), and Swiss Institute of Bioinformatics' MotifScan (https://myhits.sib.swiss/cgi-bin/motif_scan).

technical Validation
Quality control, de novo assembly, mapping and core gene statistic: Sequencing yields, assembly statistics and transcriptome completeness for the eight datasets are summarized in Table 2. The number of cleaned reads retained for the assembly exceeded 90% and ranged from 45 to 73 million ( Table 2). Seven transcriptomes had N50 values that exceeded 1,000 bp. The longest transcript in all seven assemblies was at least 13,000 bp with transcripts >20,000 bp in C. marshallae, N. cristatus and N. plumchrus. The number of Trinity transcripts ranged from 66,838 to 119,923 (Table 2). While seven assemblies were very similar, the statistics for E. bungii suggests that the reads from this species did not assemble as well as the others: there were fewer transcripts and transcript lengths were shorter (N25, N50, N75, maximum length, see Table 2). Self-mapping rate for the assemblies averaged 89% and ranged from 84% to 92% (Table 2). Based on the BUSCO analysis, the number of "complete" orthologous BUSCO "core" genes ranged from 70% to 94%, with only one species (E. bungii) failing to attain at least 85% of "complete" core genes. The transcriptomes had an additional 3% to 14% "fragmented" "core" transcripts ( Table 2).
Functional annotation. The functional annotation of the transcripts using the Swiss-Prot database identified over 30,000 significant BLAST hits for seven of the eight assemblies ( Table 2). The assembly with the lowest number of assembled transcripts (E. bungii) returned fewer than 18,000 hits. The percentage of transcripts with significant Swiss-Prot hits was similar across all eight assemblies, ranging between 44% (M. pacifica, Mp2017) and 49% (N. flemingeri, Nf2019). Most of the transcripts (>95%) with BLAST hits were also annotated with gene ontology (GO) terms with a smaller number annotated with KEGG enzyme commission numbers (EC, Table 2). The distribution of GO terms across major biological processes was similar in the eight transcriptomes (Fig. 2), as expected based on other copepod de novo assemblies 13,52 . While the assembly for E. bungii was less complete, the transcripts were of high quality, and we retained this transcriptome in the downstream analyses.

Species confirmation. COI sequences from the transcriptomes (Supplementary Information SD1)
received confirmatory top species hits with >97% identity for five species even when the assembled transcript was short ( Table 3). The COI sequence of the sixth species, C. marshallae is very similar to that of C. glacialis (>97%), raising questions about whether they are separate species or members of a species complex 53,54 . While very similar, the sequences and other posted C. marshallae complete and partial sequences (e.g. accession numbers GJQX01213347, KX675691 and KP241590), showed consistent differences with respect to C. glacialis sequences, especially in two locations (nucleotides 489 and 627) within the full-length 1,556 bp mtCOI sequence ( Supplementary Information SD1). Thus, while very similar there appeared to be consistent genetic differences between the two species. Our analysis is consistent with a recent phylogenetic analysis using 191 genes 55 . testing for contamination. Heterospecific contamination can occur in de novo assemblies from field collections and must be routinely checked for 33 . A cursory assessment of the 18 S ribosomal RNA sequences suggested some contamination from foreign sources (Supplementary Information SD2). The levels of contamination were extremely low (<1%), however, the level was 5.5% for reads mapping to a C. glacialis sequence present in the M. pacifica transcriptome. This led us to perform an additional analysis to check for cross-contamination between species by mapping the RNA-Seq data to a "species-filter" (see Methods). Bowtie mapping resulted in ca. a million reads from the E. bungii transcriptome mapping to its COI sequence, while none mapped to a human COI sequence or any other species in the species filter. Similarly, with almost half a million M. pacifica reads mapping to its COI sequence, only a total of 9 mapped to the suspected C. marshallae/C. glacialis COI sequences (see Supplementary Information SD1 Part II for filter composition and numerical details). We conclude that there was either no or negligible contamination from these sources with some counts even arising from cross-mapping.

Protein prediction.
TransDecoder found open reading frames (ORFs) of 100 amino acids or longer in more than 78% of the transcripts in all transcriptomes, with 22% to 38% predicted to be full length ( Table 2). The full-length proteins were analyzed further using the OrthoVenn2 clustering algorithms as described in the Methods and summarized in Table 4. www.nature.com/scientificdata www.nature.com/scientificdata/ intraspecific homologs. Within-species homologies among transcriptomes were expected to be high and an indicator of similarity across assemblies. As shown in Table 4, the percentage of the OrthoVenn2 clusters shared between the two C. marshallae transcriptomes approached 70%. A similar percentage was observed between at least two of the three N. flemingeri transcriptomes (Table 4; Supplementary Information SD3M). Interestingly, when partial-protein sequences were included in the OrthoVenn2 uploads, the shared fraction remained similar. Merging multiple transcriptomes with software such as CAP3 can be implemented by a user to produce a more complete reference at the expense of obscuring the hierarchical information provided by Trinity. Mergers also add genetic diversity owing to the use of more than one individual. Interspecific comparisons. As described in the Methods section, the previously-vetted N. flemingeri transcriptome 3 was translated and used as a common reference since it had been used in gene expression studies, allowing identification of genes responsive to changes in environmental conditions. The numbers of complete-protein homolog clusters that were shared between this reference and each of the new transcriptomes ranged from 40% to 70% ( Supplementary Information SD3), with the highest percentage shared between the N. flemingeri reference and its and conspecifics and two congeners, and the lowest percentages with the two more distantly-related taxa (E. bungii and M. pacifica). In the survey of the overall taxonomic extent of homologies among the primary cluster sets from the four multi-species sets of transcriptomes (Table 4), the fractions of clusters shared by all of the members ("overlap clusters") ranged from 7.2% for the set including all 6 species of the order Calanoida, to 35% for those comprising the three species of the Neocalanus genus. The fraction was greater

Neocalanus
Calanus www.nature.com/scientificdata www.nature.com/scientificdata/ the tighter the taxonomic coverage, as would be expected. Thus, while there were the expected differences among the assembled complete proteins correlating with phylogenetic distance, the large number of overlapping clusters indicates that the individual assemblies produced mutually-consistent transcriptomes.   Table 3. Species COI confirmation. Accession numbers for transcripts in each transcriptome that annotated as mitochondrial COI sequences in BLASTx into SwissProt (see Table 1 for transcriptome codes). Last three columns are values returned by the top hit in a BLASTn of the sequence in Column 2 into NCBI nr/nt database: the species ID, the % identity and the accession number. E-values for all BLASTs were 0.0 except for GJRF01071121 ( = 3E-172). Note that the top hits for 4 of the C. marshallae sequences were identified as the closely-related C. glacialis (the representation of C. marshallae sequences in nt/nr is small). Two of these, indicated with *, while ranking lower in total scores, had C. marshallae hits with 100% identity and E-scores < e-100.

Usage Notes
Reference transcriptomes for ecophysiology. The de novo transcriptomes were generated from individuals belonging to six key zooplankton species of the sub-arctic waters of the Gulf of Alaska: Neocalanus flemingeri, N. plumchrus, N. cristatus, Calanus marshallae, Eucalanus bungii and Metridia pacifica. The assemblies and their functional annotation provide references for future RNA-Seq studies focused either on single or multiple species, as well as in community studies based on metatranscriptomics 17 . The predicted sequences can be used to design species-specific primers for PCR-based studies focused on biomarkers 56 . Homologous genes identified across transcriptomes from these six co-occurring species can be used to generate a set of references for interspecies comparisons. While there are more studies that are generating RNA-Seq data for bulk samples of planktonic organisms, interspecies comparisons, especially for zooplankton, continue to be challenging given the absence of robust community-specific references. By focusing on key species from one region, these de novo transcriptomes open opportunities to examine species-specific responses and niche separation under natural and/or experimental conditions. Resource for protein discovery. De novo transcriptomes are composed of predicted expressed transcripts assembled from short sequences. These transcripts are then translated into predicted proteins, which may or may not be real. In spite of this limitation, transcriptomes have been rich data sources for protein discovery [57][58][59][60][61] . Here, we illustrate how non-annotated transcripts with predicted homologous proteins in multiple transcriptomes might be used in the search for protein discovery (Table 5). In the single-species cluster sets for C. marshallae and N. flemingeri, around a third failed annotation through lack of significant BLAST hits in SwissProt. In the multi-species sets, the 6-calanoid primary set as well as the three secondary sets derived from it (see Supplementary Information ST1 for cluster compositions), the percentages of non-annotated clusters ranged from 15% to 20%. After removing clusters with membership outside of the selected secondary taxonomic category, as described in Methods, only a few hundred prime candidates remained, inviting further investigation ( Table 5, "non-annotated taxonomically homogeneous" row).
Environmentally-responsive genes among the non-annotated predicted proteins. We compared the candidate genes to lists of environmentally-responsive transcripts of pre-adult N. flemingeri (DEGs) reported from a spatial transect in the Gulf of Alaska and an interannual comparison (2015-2017) within Prince William Sound 3,5 . Among the DEGs were 101 and 83 non-annotated transcripts/clusters along the spatial gradient and the interannual comparison, respectively (Table 5). Thirty transcripts were differentially-expressed under both conditions ( Supplementary Information ST2), highlighting the genes' likely importance in the ecophysiology of this species. The availability of homologs in the other transcriptomes not only means that these genes are "real", but also raises the question whether they may be eco-responsive in the co-occurring species under similar conditions.
Comparisons with a rapidly-evolving gene. We took a further step toward testing ways of validating the novelty of these unannotated clusters of homologous environmentally-responsive predicted proteins by searching databases for similar proteins at different phylogenetic distances from the cluster. In order to provide some context, we first characterized the similarity profile of an annotated transcript from the glutathione  www.nature.com/scientificdata www.nature.com/scientificdata/ Intra-specific (primary) 1 Secondary subsets 1 Primary 1  Table 5. OrthoVenn2 results obtained when the predicted proteins of a primary set of homologous clusters is subsampled to generate "secondary" subsets. Those clusters shared by all of the transcriptomes in the subset are enumerated in the "all" rows in the table (proteins irrespective of annotation status, and separately, just those not annotated in SwissProt by OrthoVenn2: see Methods). Numbers in parentheses are repeats of the same categories as in Table 4.Those sets are still heterogeneous with respect to taxonomic coverage. The taxonomically homogeneous non-annotated "all" category of the secondary subsets is generated by purging clusters that include "foreign" species, so only clusters restricted to the transcriptomes listed remain. Note that owing to the different procedures for generating "primary" vs "secondary" sets, the "all" categories in the multispecies sets differ somewhat from the corresponding ones of   www.nature.com/scientificdata www.nature.com/scientificdata/ Where possible, the outgroup used to root the tree was either the crustacean homolog (excluding Copepoda) or the arthropod homolog (excluding Crustacea). In (b), no adequate homolog was identified in either outgroup category, so a more distant outgroup was used (a sponge). In Panels a and b, hits were missing for certain taxonomic categories: Harpacticoida and Cyclopoida for (a) and Augaptiloidea and Arthropoda for (b). In some cases, the top-hit sequence was for a partial protein that was too short to be included in a reliable cladogram, in which case the next most similar hit was used. Accession numbers for the N. flemingeri query sequence are given above each bar graph. Species  www.nature.com/scientificdata www.nature.com/scientificdata/ S-transferases (GST) family as a reference. The GSTs belong to a gene family known to be rapidly evolving 62 . Using the BLAST-scan algorithm described in Methods, we took the omega GST from N. flemingeri as a query to identify the most similar homologous transcripts in calanoid and other copepods, crustaceans and arthropods. The bar graph in Fig. 3a shows that the similarity [ = -log 10 (E-value)] declines gradually as a function of phylogenetic distance from the N. flemingeri query. The cladogram in Fig. 3b, based on the same hits, shows nested clades with moderately-well-supported branching (bootstrap values >70) that includes homologs from all the calanoids as well as one from the harpacticoid, Tigriopus. Amino-acid substitution rates (branch lengths) between clades nested within this large clade are modest (~0.2-0.3). The cladogram largely parallels the evolutionary relationships among the Copepoda, especially the calanoid families, as would be expected for a broadly-distributed protein that changed in concert with the species' evolution 6,22,23 .
Environmentally-responsive predicted proteins of N. flemingeri that lack annotations can be compared with the GST pattern. As examples, taxonomic BLAST-scans into NCBI (TSA) of three such sequences are presented in Fig. 4. In contrast to the omega GST pattern, the predicted proteins were highly similar within a narrow taxonomic range, with a sharp decline in similarity [-log 10 (E-value)] outside of the group (Fig. 4 left panels, set 1). The high-similarity group for Fig. 4a1 included only hits from the order Calanoida. Significantly, the BLAST www.nature.com/scientificdata www.nature.com/scientificdata/ got no hits among the Harpacticoida or Cyclopoida, giving further emphasis to the uniqueness of that protein group. The high-similarity groups in Figure 4b1, c1 were even more restricted, being confined solely to calanid proteins. Most of the nodes leading to the N. flemingeri query sequence and its cluster of homologs received bootstrap support similar to that of the GST (>80). The support was particularly strong (100) for the clades that included the high-similarity (calanid) taxa in Figure 4b2, c2. For Fig. 4a2, while the branch pattern in the cladogram produced by MrBayes grouped the calanoids appropriately, bootstrap support was not strong. In all three cladogram examples, substitution rates (branch lengths) between the high-similarity taxa and those outside the group were notably longer than for the GST (>0.5). Thus both the sharp rises in the BLAST-scan similarity with taxonomic proximity to N. flemingeri and the long branch lengths for the corresponding points in the cladograms are consistent with a phylogenetic point of emergence of a novel protein within the Copepoda.
Functional motifs of candidate novel proteins. The presence of functional motifs -sequences of amino acids that have been shown in other studies to endow proteins with known properties -can provide some insights into function. All three putative novel proteins from Fig. 4 contained functional motifs as shown in Fig. 5. The protein of Fig. 5a appears to be a member the pacifastin family, as it contains the typical repeated 6-cysteine-residue motifs (Fig. 5a, violet highlight). The pacifastins are arthropod serine peptidase inhibitors that have been shown to have a role in the immune response 63 . This copepod sequence features four inhibitor motifs. Interestingly, the malacostracan (amphipod) homolog possesses many more (17; Supplementary  Information SD4). The predicted protein of Fig. 5b includes a channel-pore-lining motif (green highlight), suggesting a role as a trans-membrane ion channel. Finally, in the protein of Fig. 5c, the N-terminal portion contains an ankyrin repeat motif (red highlights), which is highly conserved throughout the Arthropoda. A comparison across homologous sequences showed that the predicted proteins of the five myelinate species possess an extensive C-terminal segment that is missing from the amyelinates and more distantly-related taxa. This is shown diagrammatically in Fig. 5d which separates the conserved portion of the protein (red) from the C-terminal extension (blue) (see SD4 for all sequences).

Code availability
Parameters to software tools involved are described below: FastQC